Претрага
76 items
-
An Italian-Serbian Sentence Aligned Parallel Literary Corpus
This article presents the construction and relevance of an Italian-Serbian sentence-aligned parallel corpus, delving into the aligned sentences in order to facilitate effective translation between the two languages. The parallel corpus serves as a valuable resource for language experts, researchers, and language enthusiasts, fostering a deeper understanding of linguistic nuances and cultural expressions. By bridging the gap between Serbian and Italian, this corpus opens new avenues for cross-cultural communication and collaboration, and ultimately contributes to the improvement of language-related ...Saša Moderc, Ranka Stanković, Aleksandra Tomašević, Mihailo Škorić. "An Italian-Serbian Sentence Aligned Parallel Literary Corpus" in Review of the National Center for Digitization, Belgrade : Faculty of Mathematics, University of Belgrade (2023)
-
Using English Baits to Catch Serbian Multi-Word Terminology
In this paper we present the first results in bilingual terminology extraction. The hypothesis of our approach is that if for a source language domain terminology exists as well as a domain aligned corpus for a source and a target language, then it is possible to extract the terminology for a target language. Our approach relies on several resources and tools: aligned domain texts, domain terminology for a source language, a terminology extractor for a target language, and a ...aligned texts, word alignment, terminology extraction, electronic dictionaries, morphological inflection... from the target part of the aligned corpus having some expected syntac- tic structure. We will denote an entry from this list with T (term.extract). 2. Processing: • Aligning bilingual chunks (possible translation equivalents) from the aligned corpus. We will denote aligned chunks with S(align.chunk) ...
... source language domain terminology exists as well as a domain aligned corpus for a source and a target language, then it is possible to extract the terminology for a target language. Our approach relies on several resources and tools: aligned domain texts, domain terminology for a source language, a t ...
... developed. The overall design of our system (Fig- ure1) is as follows: 1. Input: • A sentence-aligned domain-specific corpus in- volving a source and a target language. We will denote an entry in this corpus with S(text.align) ↔ T (text.align); • A list of terms from the same domain in a source language ...Cvetana Krstev, Branislava Šandrih, Ranka Stanković. "Using English Baits to Catch Serbian Multi-Word Terminology" in Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018, European Language Resources Association (ELRA) (2018)
-
E-Connecting Balkan Languages
In this paper we present a versatile language processing tool that can be successfully used for many Balkan languages. This tool relies for its work on several sophisticated textual and lexical resources that were developed for most of Balkan languages. These resources are based on several de facto standards in natural language processing.... l’Institut Gaspard- Monge, CNRS, 2005. [4] T. Erjavec and N. Ide. The MULTEXT-East Corpus. In LREC’98, Granada, pp. 971-974, 1998. [5] A. Gelbukh, G. Sidorov, J.-A. Vera-Félix. A Bilingual Corpus of Novels Aligned at Paragraph Level. In proc. FinTAL-2006. Lecture Notes in Artificial Intelligence ...
... pp. 14-20, 2008. [18] R. Steinberger, B. Pouliquen, A. Widiger, C. Ignat, T. Erjavec, D. Tufiş. 2006. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th LREC Conference, Genoa, Italy, 22-28 May, 2006, pp.2142-2147, 2006. [19] M. Tran, D. Maurel ...
... американски щати are connected automatically. 3. Using WS4LR with Aligned Texts The WS4LR module that works with aligned texts expects them to be in Translation Memory eXchange (TMX) format1. It can also transform texts previously aligned by XAlign into that format but also in several other formats: ...Cvetana Krstev, Ranka Stanković, Duško Vitas, Svetla Koeva. "E-Connecting Balkan Languages" in Proceedings of the Workshop Workshop on Multilingual resources, technologies and evaluation for Central and Eastern European Languages, 17 September 2009, eds. C. Vertan, S. Piperidis, E. Paskaleva and Milena Slavcheva, Borovets, Bulgaria : Association for Computational Linguistics Stroudsburg, PA, USA (2009)
-
A Mathematical Learning Environment Based on Serbian Language Resources
In recent years, in line with ever growing usage of Information technology, the learning environments are changing. The amount of available learning materials in various forms has increased. These new environments demand comprehensive learning systems, which enable management of the learning corpus with special attention paid to relevant lexical resources. In this paper we present the concept of a Mathematical Learning Environment in Serbian (MLES), which is based on a corpus of mathematical materials and various lexical resources, enabling ...... is dedicated to corpus processing and alignment with existing lexical resources. (Figure 1). Figure 1. Structure of corpus processing The obtained results are processed text, augmented dictionaries and annotated content. In this component a special challenge to corpus processing results ...
... between tokens). Through CQP web users can both specify query patterns and get statistical information about corpus. Within preparation of the MLES corpus to each word within the corpus the following information is assigned, in the following order: Word type (noun, verb, adjective, etc.) - ...
... annotation. The advantages of corpus annotation are the following: Corpus search becomes more efficient due to the possibility of formulating more precise queries. When search results are concerned, annotation. Compensates for the information lost during corpus preparation (removed figures ...Radojičić Marija, Obradović Ivan, Stanković Ranka, Utvić Miloć, Kaplar Sebastijan. "A Mathematical Learning Environment Based on Serbian Language Resources" in Proceedings of the 7th International Scientific Conference Technics and Informatics in Education, Faculty of Technical Sciences, Čačak (2018)
-
Vebran Web Services for Corpus Query Expansion
Ranka Stanković, Miloš Utvić (2020)U ovom radu se govori o razvoju veb usluga Vebran i njihovoj primeni u poboljšanju pretraživanja korpusa. Veb-servisi Vebran koriste se za konsultovanje spoljnih leksičkih izvora za srpski jezik (uglavnom elektronski morfološki rečnici i srpski Vordnet) i proširivanje korisničkih upita radi dobijanja relevantnijih rezultata iz srpskih korpusa.... with documents on finance, health, law and education; – SrpFranKor4, aligned French-Serbian corpus; – SrpNemKor5, aligned German-Serbian corpus; – RudKor6, a specialized monolingual corpus of texts from the mining domain, etc. The query expansion will be demonstrated using examples in two Ser- bian ...
... Language Resources and Technologies Society (JeRTeh): – monolingual general corpora: Corpus of Contemporary Serbian (versions SrpKor2003 and SrpKor2013)1 and its subset SrpLemKor2; – SrpEngKor3, aligned English-Serbian corpus including subcorpus SELFEH (Serbian-English Law Finance Education and Health) with ...
... all lev- els of the text structure (section, title, paragraph, sentence) are annotated in some particular corpus texts, especially those which are part of aligned corpora. The SrpKor2013 corpus is used by more than 700 users, mostly Slavists. 2.2 RudKor Systematic collection and preparation of texts ...Ranka Stanković, Miloš Utvić. "Vebran Web Services for Corpus Query Expansion" in Infotheca, Faculty of Philology, University of Belgrade (2020). https://doi.org/10.18485/infotheca.2019.19.2.5
-
Towards translation of educational resources using GIZA++
... equivalents, is presented, from aligned texts in SELFEH, a Serbian- English corpus of texts related to education, finance, health and law, aligned at the sentence level within Intera project. The corpus was lemmatized and the method applied on lemmas of word forms from the corpus, by extracting candidate ...
... with the training pipeline, and obviously mis-aligned sentences are also removed. Tokenisation launch was initialized by the following sequence: ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \ < ~/corpus/training/edX.en \ > ~/corpus/edX.tok.en ~/mosesdecoder/scripts/tokeni ...
... case was 16GB. Corpus Preparation For our research we used five text collections, three of them being scientific journals and two resources produced within international projects. Total number of documents is 299 in English and the same number in Serbian, while the total of aligned sentences is ...Ivan Obradović, Dalibor Vorkapić, Ranka Stanković, Nikola Vulović, Miladin Kotorčević. "Towards translation of educational resources using GIZA++" in The Seventh International Conference on e-Learning (eLearning-2016), September 2016, Belgrade : Metropolitan Univesity (2016)
-
Managing mining project documentation using human language technology
Purpose: This paper aims to develop a system, which would enable efficient management and exploitation of documentation in electronic form, related to mining projects, with information retrieval and information extraction (IE) features, using various language resources and natural language processing. Design/methodology/approach: The system is designed to integrate textual, lexical, semantic and terminological resources, enabling advanced document search and extraction of information. These resources are integrated with a set of Web services and applications, for different user profiles and use-cases. Findings: The ...Digital libraries, Information retrieval, Data mining, Human language technologies, Project documentationAleksandra Tomašević, Ranka Stanković, Miloš Utvić, Ivan Obradović, Božo Kolonja . "Managing mining project documentation using human language technology" in The Electronic Library (2018). https://doi.org/10.1108/EL-11-2017-0239
-
Football terminology: compilation and transformation into OntoLex-Lemon resource
У овом раду представља се пројекат који је у развоју, креирање првог дигиталног фудбалског речника на српском језику, као и да демонстрација примене модела OntoLex и љегових модула. OntoLex-FrAC модул укључује информације о учесталости и примерима употребе екстрахованих из корпуса. У овом случају, креиран је корпус за специфичан домен под називом СрФудКо, који садржи чланке вести о фудбалу на српском језику. Вишечлани термини аутоматски су екстраховани из српског корпуса, а затим ручно евалуирани и класификовани као спортски или ...Jelena Lazarević, Ranka Stanković, Mihailo Škorić, Biljana Rujević. "Football terminology: compilation and transformation into OntoLex-Lemon resource" in LDK 2023 – 4th Conference on Language, Data and Knowledge, 12-15 September in Vienna, Austria, Lisabon : NOVA FCSH - CLUNL (2023). https://doi.org/10.34619/srmk-injj
-
A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian
Uvredljivi govor na društvenim medijima, uključujući psovke, pogrdni govor i govor mržnje, dostigao je nivo pandemije. Sistem koji bi bio u stanju da detektuje takve tekstove mogao bi da pomogne da internet i društveni mediji postanu bolji virtuelni prostor sa više poštovanja. Istraživanja i komercijalna primena u ovoj oblasti do sada su bili fokusirani uglavnom na engleski jezik. Ovaj rad predstavlja rad na izgradnji AbCoSER-a, prvog korpusa uvredljivog govora na srpskom jeziku. Korpus se sastoji od 6.436 ručno označenih ...... should we take when building the corpus of abusive language, several future implications were considered: 1) To the best of our knowledge, AbCoSER (Abusive Corpus for serbian) is the first corpus tackling abusive language phenomenon in the Serbian language; 2) This corpus is to be used to enrich our lexicon ...
... n of tweets length. Figure 5 Tree clouds of abusive language corpus: the non-abusive subset (left) and abusive subset (right). 3 A Corpus and Lexicon for Abusive Speech 3.1 Analysis of the twitter corpus The distribution of our corpus tweets length after removal of mentions is shown in Figure 4: ...
... from the abusive tweet corpus) and for neutral language (derived from the corpus of non-abusive tweets), which enables calculation of the so-called keyness score, which should represent the extent of the frequency difference. These frequencies can also be compared with the corpus of standard Serbian (as ...Danka Jokić, Ranka Stanković, Cvetana Krstev, Branislava Šandrih. "A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian" in 3rd Conference on Language, Data and Knowledge (LDK 2021), MDPI AG (2021). https://doi.org/10.4230/OASIcs.LDK.2021.13
-
Bilingual lexical extraction based on word alignment for improving corpus search
Jelena Andonovski, Branislava Šandrih, Olivera Kitanović. "Bilingual lexical extraction based on word alignment for improving corpus search" in The Electronic Library, Emerald (2019). https://doi.org/10.1108/EL-03-2019-0056
-
Towards Automatic Definition Extraction for Serbian
U radu su prikazani preliminarni rezultati automatske ekstrakcije kandidata za definicije rečnika iz nestrukturiranih tekstova na srpskom jeziku u cilju ubrzanja razvoja rečnika. Definicije u rečniku Srpske akademije nauka i umetnosti (SANU) korišćene su za modelovanje različitih tipova definicija (opisnih, gramatičkih, referentnih i sinonimskih) koje imaju različite sintaksičke i leksičke karakteristike. Korpus istraživanja sastoji se od 61.213 definicija imenica, koje su analizirane korišćenjem morfoloških e-rečnika i lokalnih gramatika implementiranih kao pretvarači konačnih stanja u paketu za obradu korpusa otvorenog ...... enabling the connection are given in italic. 4 Corpus Analysis 4.1 Creating a Textbook Corpus For testing the automatic extraction of definitions from unstructured text by means of local grammars presented in the previous section, we created a corpus of 25 primary and secondary school textbook covering ...
... 3 4.2 Recognition of Candidates in the Textbook Corpus To determine whether it is possible to recognize definitions of domain-specific terms in the domain corpus text, a subset of local grammars presented in Section 3 was applied to the corpus consisting of 25 textbooks; namely we excluded models ...
... different syntactic and lexical features. The research corpus consists of 61,213 definitions of nouns, which were analysed using Serbian morphological e-dictionaries and local grammars implemented as finite state transducers in an open-source corpus processing suite Unitex. The 21 models developed up ...Ranka Stanković, Cvetana Krstev, Rada Stijović, Mirjana Gočanin, Mihailo Škorić. "Towards Automatic Definition Extraction for Serbian" in Proceedings of the XIX EURALEX Congress of the European Assocition for Lexicography: Lexicography for Inclusion (Volume 2). 7-9 September (virtual), Democritus University of Thrace (2021)
-
Two approaches to compilation of bilingual multi-word terminology lists from lexical resources
In this paper, we present two approaches and the implemented system for bilingual terminology extraction that rely on an aligned bilingual domain corpus, a terminology extractor for a target language, and a tool for chunk alignment. The two approaches differ in the way terminology for the source language is obtained: the first relies on an existing domain terminology lexicon, while the second one uses a term extraction tool. For both approaches, four experiments were performed with two parameters being ...Branislava Šandrih, Cvetana Krstev, Ranka Stanković. "Two approaches to compilation of bilingual multi-word terminology lists from lexical resources" in Natural Language Engineering, Cambridge University Press (CUP) (2020). https://doi.org/10.1017/S1351324919000615
-
Rule-based Automatic Multi-word Term Extraction and Lemmatization
In this paper we present a rule-based method for multi-word term extraction that relies on extensive lexical resources in the form of electronic dictionaries and finite-state transducers for modelling various syntactic structures of multi-word terms. The same technology is used for lemmatization of extracted multi-word terms, which is unavoidable for highly inflected languages in order to pass extracted data to evaluators and subsequently to terminological e-dictionaries and databases. The approach is illustrated on a corpus of Serbian texts from ...... within the domain corpus, whereas the remaining two measures compare term frequency in the domain corpus and the general language corpus, thus illustrating how specific the MWU is for the selected domain. As the general corpus we used a 22 million words excerpt from the Corpus of Contemporary ...
... For evaluation we used a corpus that contains 10textbooks, 2 projects and 51 journal articles from the mining domain. The size of this corpus is 32,633 sentences and 625,105 simple word forms. For calculation of measures that compare results on a domain corpus with general language we used ...
... Language Using Rules Automatically Generated Based on the Corpus Analysis. In Proc. of 7th Language & Technology Conference 2015, Poznań: Fundacja Uniwersytetuim. A. Mickiewicza, pp. 540--544. Pantel, P., Dekang L. (2001). A statistical corpus-based term extractor. In Stroulia, E., Matwin, S. (Eds ...Ranka Stanković, Cvetana Krstev, Ivan Obradović, Biljana Lazić, Aleksandra Trtovac. "Rule-based Automatic Multi-word Term Extraction and Lemmatization" in Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, Portorož, Slovenia, 23--28 May 2016, European Language Resources Association (2016)
-
Corpus-based bilingual terminology extraction in the power engineering domain
Ovaj rad predstavlja resurse i alate koji se koriste za ekstrkciju i evaluaciju dvojezične, englesko-srpske terminologije u domenu energetike. Resursi se sastoje od postojeće opšte i domenske leksike i domenskog paralelnog korpusa; alati uključuju ekstraktore termina za oba jezika i alat za poravnavanje segmenata koji pripadaju korpusnim rečenicama. Sistem je testiran variranjem funkcije podudaranja koja utvrđuje prisustvo ekstrahovanog termina u poravnatom segmentu (odsečak), u rasponu od veoma labavog do strogog. Procena rezultata je pokazala da je preciznost izdvajanja termina ...Tanja Ivanović, Ranka Stanković, Branislava Šandrih Todorović, Cvetana Krstev. "Corpus-based bilingual terminology extraction in the power engineering domain" in Terminology, John Benjamins Publishing Company (2022). https://doi.org/10.1075/term.20038.iva
-
A Data Driven Approach for Raw Material Terminology
Olivera Kitanović, Ranka Stanković, Aleksandra Tomašević, Mihailo Škorić, Ivan Babić, Ljiljana Kolonja (2021)The research presented in this paper aims at creating a bilingual (sr-en), easily searchable, hypertext, born-digital, corpus-based terminological database of raw material terminology for dictionary production. The approach is based on linking dictionaries related to the raw material domain, both digitally born and printed, into a lexicon structure, aligning terminology from different dictionaries as much as possible. This paper presents the main features of this approach, data used for compilation of the terminological database, the procedure by which it has ...sirovine, rudarstvo, terminologija, rečnik, terminološka aplikacija, mobilna aplikacija, digitizacija, leksički podaci, korpusi, otvoreni povezani podaci... (32%). The bilingual corpus of texts aligned on the sentence level was produced from the bilin- gual digital library Bibliša. The initial set of 55 documents containing 4831 aligned Serbian- English sentences [29] was enlarged with 44 new documents containing 12,657 aligned sentences from the raw material ...
... generated from two sources, namely, by retrieval from the bilingual MD and by extraction from the aligned bilingual corpus. Term entries from MD were parsed and only those that were confirmed by the mining corpus (monolingual or bilingual) were selected. As mentioned before, one term entry can comprise more ...
... characteristics of examples from this corpus is compared with the characteristics of the distribution of the sample sentences extracted from the corpus that contains different texts. The approach was adapted to work also for English and to be applied for bilingual aligned sentences. For ranking, we have used ...Olivera Kitanović, Ranka Stanković, Aleksandra Tomašević, Mihailo Škorić, Ivan Babić, Ljiljana Kolonja. "A Data Driven Approach for Raw Material Terminology" in Applied Sciences, MDPI AG (2021). https://doi.org/10.3390/app11072892
-
FrameNet Lexical Database: Presenting a Few Frames Within the Risk Domain
U radu se daje kratak prikaz teorije semantike okvira, na kojoj je zasnovana leksička baza Frejmnet. Predstavljena je koncepcija ove mreže, kao i mogućnosti njene primene. Predstavljena je i leksička analiza koja se primenjuje u projektu izrade Frejmneta i ukazano na razlike između analize zasnovane na okviru u odnosu na analizu zasnovanu na reči. Zatim je prikazano nekoliko povezanih okvira koje prizivaju reči iz domena rizika. U radu je predstavljena i platforma NLTК pomoću koje se mogu koristiti ...... (Natural Language Toolkit) suite, which provides a good natural language pro- cessing resource. The last chapter shows a corpus search of the noun risk in a mining- themed corpus. We also present its most common collocates, word sketch, individual pattern concordances, thesaurus entry of its synonyms ...
... analysis of the meaning of an LU, its lexical surroundings, phrases and grammatical constructions in which it appears in the corpus, the context in which it is used provided by corpus examples, as well as all the phrases in which the LU fulfills its full semantic potential. This approach consists of listing ...
... 2016, 21) and its use is looked into by extracting sentences, which contain it, from the corpus. A lexicographer working in FrameNet compares his or her insight into the meaning of a target lexeme, based on corpus examples, to the meaning given in descriptive dictionaries.8 Once he gets a clearer idea ...Aleksandra Marković, Ranka Stanković, Natalija Tomić, Olivera Kitanović. "FrameNet Lexical Database: Presenting a Few Frames Within the Risk Domain" in Infotheca, Faculty of Philology, University of Belgrade (2021). https://doi.org/10.18485/infotheca.2021.21.1.1
-
Distant Reading in Digital Humanities: Case Study on the Serbian Part of the ELTeC Collection
Ranka Stanković, Cvetana Krstev, Branislava Šandrih Todorović, Duško Vitas, Mihailo Škorić, Milica Ikonić Nešić (2022)In this paper we present the Serbian part of the ELTeC multilingual corpus of novels written in the time period 1840-1920. The corpus is being built in order to test various distant reading methods and tools with the aim of re-thinking the European literary history. We present the various steps that led to the production of the Serbian sub-collection: the novel selection and retrieval, text preparation, structural annotation, POS-tagging, lemmatization and named entity recognition. The Serbian sub-collection was published ...Ranka Stanković, Cvetana Krstev, Branislava Šandrih Todorović, Duško Vitas, Mihailo Škorić, Milica Ikonić Nešić. "Distant Reading in Digital Humanities: Case Study on the Serbian Part of the ELTeC Collection" in Proceedings of the Language Resources and Evaluation Conference, June 2022, Marseille, France, European Language Resources Association (2022)
-
Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian
The training of new tagger models for Serbian is primarily motivated by the enhancement of the existing tagset with the grammatical category of a gender. The harmonization of resources that were manually annotated within different projects over a long period of time was an important task, enabled by the development of tools that support partial automation. The supporting tools take into account different taggers and tagsets. This paper focuses on TreeTagger and spaCy taggers, and the annotation schema alignment ...... MULTEXT-East and Universal Part-of-Speech tagset. The trained models will be used to publish the new version of the Corpus of Contemporary Serbian as well as the Serbian literary corpus. The performance of developed taggers were compared and the impact of training set size was investigated, which resulted ...
... automate the process of an- notation schema harmonization and preparation of training datasets as much as possible. The corpus of training set texts needed checking and cor- rection. Corpus correction is a time consuming process, which requires a lot of manual intervention and help of lin- guistic specialists ...
... the multilin- gual corpus prepared in the scope of the project “Integrated European language data Repository Area” (Gavrilidou et al., 2006). It contains texts from law, health and edu- cation domains. Švejk, Floods, History are three short 1Unitex/GramLab — Cross Plaform Corpus Processing Suite, ...Ranka Stanković, Branislava Šandrih, Cvetana Krstev, Miloš Utvić, Mihailo Škorić. "Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian" in Proceedings of the 12th Language Resources and Evaluation Conference, May Year: 2020, Marseille, France, European Language Resources Association (2020)
-
SASA Dictionary as the Gold Standard for Good Dictionary Examples for Serbian
Ranka Stanković, Branislava Šandrih, Rada Stijović, Cvetana Krstev, Duško Vitas, Aleksandra Marković (2019)У овом раду представљамо модел за избор добрих примера за речник српског језика и развој иницијалних компоненти модела. Метода која се користи заснива се на детаљној анализи различитих лексичких и синтактичких карактеристика у корпусу састављених од примера из пет дигитализованих свезака речника САНУ. Почетни скуп функција био је инспирисан сличним приступом и за друге језике. Дистрибуција карактеристика примера из овог корпуса упоређује се са карактеристиком дистрибуције узорака реченица ексцерпираних из корпуса који садрже различите текстове. Анализа је показала да ...Српски, добри примери из речника, аутоматизација израде речника, издвајање својстава, Машинско учење... supported by the corpus data. It is common for lexicographers to look for examples in the corpus of contemporary Serbian (SrpKor, developed by D. Vitas and a group of collaborators from University of Belgrade, http://www.korpus.matf.bg.ac.rs/korpus/), which is being used as a control corpus, but they rarely ...
... to the corpus made of examples, we prepared a control dataset derived from various texts, which was used as a sample corpus for dictionary example extraction. The control dataset of example candidates was obtained from the digital library Biblisha8 (Stanković et al., 2017), SrpKor – the corpus of c ...
... syntactic features in a corpus compiled of examples from the five digitized volumes of the Serbian Academy of Sciences and Arts (SASA) dictionary. The initial set of features was inspired by a similar approach for other languages. The feature distribution of examples from this corpus is compared with the ...Ranka Stanković, Branislava Šandrih, Rada Stijović, Cvetana Krstev, Duško Vitas, Aleksandra Marković. "SASA Dictionary as the Gold Standard for Good Dictionary Examples for Serbian" in Electronic lexicography in the 21st century. Proceedings of the eLex 2019 conference , Lexical Computing CZ, s.r.o. (2019)
-
Extraction of Bilingual Terminology Using Graphs, Dictionaries and GIZA++
Branislava Šandrih, Ranka Stanković (2020)U nauci, industriji i mnogim istraživačkim oblastima, terminologija se brzo razvija. Najčešće, jezik koji je „lingua franca“ za većinu ovih oblasti je engleski. Kao posledica toga, za mnoga polja termini domena su koncipirani na engleskom, a kasnije se prevode na druge jezike. U ovom radu predstavljamo pristup za automatsko izdvajanje dvojezične terminologije za englesko-srpski jezički par koji se oslanja na usaglašeni dvojezični korpus domena, ekstraktor terminologije za ciljni jezik i alat za usklađivanje delova. Ispitujemo performanse metode na domenu ...... obtaining 8 different experimental settings: 1. The input domain aligned corpus (Input i) consists of: (a) the aligned corpus LIS-corpus; (b) the aligned corpus LIS-corpus extended with the bilingual aligned pairs bi-list (LIS-corpus+); 2. The list of domain terms for the source language (Input ii) ...
... sentence-aligned domain-specific corpus involving a source and a target language, denoted as S(text.align) ↔ T (text.align). In this paper we refer to this tool as LIS-corpus. As a textual resource, twelve issues with a total of 84 papers were aligned at the sentence level resulting in 14,710 aligned segments ...
... applied to the source language part of the aligned input corpus; 3. The extraction of the set of MWTs in the target language by Serb-TE (Input iii) was done: (a) on the target language part of the aligned chunks (chunk); (b) on the target language part of the aligned input sentences (text). Infotheca Vol ...Branislava Šandrih, Ranka Stanković. "Extraction of Bilingual Terminology Using Graphs, Dictionaries and GIZA++" in Infotheca, Faculty of Philology, University of Belgrade (2020). https://doi.org/10.18485/infotheca.2019.19.2.6